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Abstract 

Context tree models have been introduced by Rissanen in pjj as a parsimonious generalization 
of Markov models. Since then, they have been widely used in applied probability and statistics. 
The present paper investigates non-asymptotic properties of two popular procedures of context tree 
estimation: Rissancn's algorithm Context and penalized maximum likelihood. First showing how 
they arc related, we prove finite horizon bounds for the probability of over- and under-estimation. 
Concerning over-estimation, no boundedness or loss-of-memory conditions are required: the proof 
relies on new deviation inequalities for empirical probabilities of independent interest. The under- 
estimation properties rely on classical hypotheses for processes of infinite memory. These results 



improve on and generalize the bounds obtained in [12J, [18|, |17J, l22|, refining asymptotic results 

of Hi. 
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1. Introduction 

Context tree models (CTM), first introduced by Jorma Rissanen in [25| as efficient tools in 
Information Theory, have been successfully studied and used since then in many fields of Probability 
and Statistics, including Bioinformatics [Hill], Universal Coding 27 1, Mathematical Statistics [j] 



or Linguistics [la]. Sometimes also called Variable Length Markov Chain (VLMC), a context tree 
process is informally defined as a Markov chain whose memory length depends on past symbols. 
This property makes it possible to represent the set of memory sequences as a tree, called the 
context tree of the process. 

A remarkable tradeoff between expressivity and simplicity explains this success: no more dif- 
ficult to handle than Markov chains, they appear to be much more flexible and parsimonious, 
including memory only where necessary. Not only do they provide more efficient models for fitting 
the data: it appears also that, in many applications, the shape of the context tree has a natural 
and informative interpretation. In Bioinformatics, the contexts trees of a sample have been useful 
to test the relevance of protein families databases Q and in Linguistics, tree estimation highlights 
structural discrepancies between Brazilian and European Portuguese [la] . 

Of course, practical use of CTM requires the possibility of constructing efficient estimators 
of the model To generating the data. It could be feared that, as a counterpart of the model 
multiplicity, increased difficulty would be encountered in model selection. Actually, this is not 
the case, and soon several procedures have been proposed and proved to be consistent. Roughly 
speaking, two families of context tree estimators are available. The first family, derived from the 
so-called algorithm Context introduced by Rissanen in [25j . is based on the idea of tree pruning. 
They are somewhat reminiscent of the CART Q pruning procedures: a measure of discrepancy 
between a node's children determines whether they have to be removed from the tree or not. The 
second family of estimators are based on a classical approach of mathematical statistics: Penalized 
Maximum Likelihood (PML). For each possible model, a criterion is computed which balances the 
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quality of fit and the complexity of the model. In the framework of Information Theory, these 
procedures can be interpreted as derivations of the Minimum Description Length principle 

In the case of bounded memory processes, the problem of consistent estimation is clear: an 
estimator T is strongly consistent if it is equal to To eventually almost surely as the sample size 
grows to infinity. As soon as 1983, Rissanen proved consistency results for the algorithm Context 
in this case. But later, the possibility of handling infinite memory processes was also addressed. 
In an estimator T is called strongly consistent if for every positive integer K, its truncation 
T\ K at level K is equal to the truncation To I if of To eventually almost surely. With this definition, 
PML estimators are shown to be strongly consistent if the penalties are appropriately chosen and 
if the maximization is restricted to a proper set of models. This last restriction was proven to be 
unnecessary in the finite memory case [19| . 

More recently, the problem of deriving non- asymptotic bounds for the probability of incorrect 
estimation was considered. In [lij ]. non- universal inequalities were derived for a version of the 
algorithm Context in the case of finite context trees. These results were generalized to the case of 
infinite trees in [T3], and to PML estimators in 22 1. Using recent advances in weak dependence 
theory, all these results strongly rely on mixing hypotheses of the process. 

For all these results, a distinction has to be made between two potential errors: under- and 
over-estimation. A context of To is said to be under- estimated if one of its proper suffixes appears 
in the estimated tree T, whereas it is called over-estimated if it appears as an internal node of T. 
Over- and under-estimation appear to be of different natures: while under-estimation is eventually 
avoided by the existence of a strictly positive distance between a process and all processes with 
strictly smaller context trees, controlling over-estimation requires bounds on the fluctuations of 
empirical processes. 

In this article, we present a unified analysis of the two families of context tree estimators. 
We contribute to a completely non-asymptotic analysis: we show that for appropriate parameters 
and measure of discrepancy, the PML estimator is always smaller than the estimator given by 
the algorithm Context. To our knowledge, this is the first result comparing this two context tree 
selection methods. 

Without restrictions on the (possibly infinite) context tree To, we prove that both methods 
provide estimators that are with high probability sub-trees of T (i.e., a node that is not in T does 
not appear in T). These bounds are more precise and do not require the conditions assumed in 



18l. Il7t |22|. for this purpose, we derive "self-normalized" non-asymptotic deviation inequalities 
using martingale techniques inspired from proofs of the Law of the Iterated Logarithm 24 Ji 



These inequalities prove interesting in other fields, as for instance in reinforcement learning |21|.|14|. 
On the other hand, we derive upper bounds on the probability of under-estimation by assuming 
classical mixing conditions on the process generating the sample: with high probability, T contains 
every node of To at moderate height. This result is based on exponential inequalities derived for a 
wider class of processes than in [3, [13, HH . 

Our upper bounds on the probability of over- and under-estimation imply strong consistency of 
the PML estimators for a larger class of penalizing functions than in [22| ■ Similarly, in the case of 
the algorithm Context the strong consistency can also be derived for suitable threshold parameters, 
generalizing the convergence in probability for this estimator obtained previously in [12| . 

The paper is organized as follows. In Section [2] we set notation and definitions, we describe in 
detail the algorithms and we state our main results. The proof of these results is given in Section 
[31 In Scction|3]we briefly discuss our results. Appendix A contains the statement and proof of the 
self-normalized deviation inequalities and Appendix B is devoted to the presentation of exponential 
inequalities for weak dependent processes. 



2. Notations and results 

In what follows, A is a finite alphabet; its size is denoted by \A\. A^ denotes the set of all 
sequences of length j over A, in particular A has only one element, the empty sequence. We 
denote by A* = {J k>0 A k the set of all finite sequences on alphabet A and A°° will denote the 
set of all semi-infinite sequences v = (. . . , V-2, f— 1) of symbols in A. The length of the sequence 
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w e A* is \w\. For 1 < i < j < \w\, we denote wj = (tDi, . . . , wj) £ AP~ l+1 and vZ^ denotes the 
semi-infinite sequence (. . . , v_2, v-i) £ A°° . Given v £ A* U A 00 and u £ 4*. we denote by vw the 
sequence obtained by concatenating the two sequences v and w. We say that the sequence s G A* 
is a suffix of the sequence iu G A* U if there exists a sequence it G A* U A 00 such that w = us. 
In this case we write iu >; s or s < w. When |u| > 1 we say that s is a proper suffix of w and we 
write w y s or s -< iy. 

A set T C A* U is a tree if no sequence s G T is a proper suffix of another sequence w G T. 
The height of the tree T is defined as 

h{T) = sup{H : w G T}. 

If /i(T) < +oo we say that T is bounded and we denote by \T\ the cardinality of T. If /i(T) = +00 
we say that T is unbounded. The elements of T are also called the leaves of T. An internal node 
of T is a proper suffix of a leaf. For any sequence w G A* U A°° and for any tree T, we define the 
tree T w as the set of leaves in T which have w as a suffix, that is 

T w = {u £T: u y w}. 

Given a tree T and an integer K we will denote by T\k the tree T truncated to level K, that is 

r|if = {w G T: [w| < A'} U {w G A K : w -< u for some u G T}. 

Given two trees T\ and T2 we say that Tj is included in T2 (denoted by T\ < T2 or T2 >r Ti ) if for 
any sequence w G Ti there exists a sequence u £ T2 such that w ^ u; in other words, all leaves of 
T\ are either leaves or internal nodes of T2. 

Consider a stationary ergodic stochastic process {X t : t £ Z} over A Given a sequence w £ A* 
we denote by 

p(w) = V{x[ wl = w) 

the stationary probability of the cylinder defined by the sequence w. If p(w) > we write 

p(a\w) = P(X = a I X~\ w \ = w) . 

Definition 2.1. A sequence w £ A* is a finite context for the process {X t : t G Z} if it satisfies 

1. p(w) > 0; 

2. for any sequence v £ A* such that p(v) > and v > w, 

P(A = a I X~L =v) = p(a\w), for all a £ A; 

3. no proper suffix of w satisfies 1. and 2. 

An infinite context is a semi- infinite sequence wZ ^ G such that any of its finite suffixes wZj , 
j = 1, 2, . . . is a context. In what follows the term context will refer to a finite or infinite context. 

It can be seen that the set of all contexts of the process {X t : t £ Z} is a tree. This is called 
the context tree of the process. For example, the context tree of an i.i.d. process is A and the 
context tree of a generic Markov chain of order 1 is A 1 = A. In what follows, we will denote by To 
the context tree of the process { X t : t G Z}. 

Let d < n be positive integers. Let X—d+i, ■ ■ ■ , Xq, X±, . . . X n be a sequence distributed ac- 
cording to P. For any sequence w £ A* and any symbol a £ A we denote by N n (w, a) the number 
of occurrences of symbol a in A™ that are preceded by an occurrence of w, that is: 

n 

N n (w,a) = Y, l{^t-H =w,X t = a}. (2.2) 
t=i 

The sum X^aeA A n (w, a) is denoted by N n (w). 

We will denote by V„ the set of all sequences w £ A* that appear at least once in the sample, 
that is 

V„ = {w£ A* : N„ (w) > 1}. 
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Definition 2.3. We will say that a tree T C V„ is acceptable if it satisfies the following conditions: 

1. h(T) < d; and 

2. every sequence w G A* such that N n (w) > 1 belongs to T or has a proper suffix that belongs 
to T. 

Then, our set of candidate trees, denoted by 3T n , will be the set of all acceptable trees. Our 
goal is to select a tree T <G 8T n as close as possible to To, in some sense that will be formally given 
below. Note that d may depend on n, so that the set of candidate trees is allowed to grow with the 
sample size. The symbols X_<j+i, . . . , Xq are only observed to ensure that, for every candidate tree 
T, the context of Xi in T is well defined, for every i = 1, . . . ,n. Alternatively, if X—j+i, . . . , Xq 
were not assumed observed, similar results would be obtained by using quasi-maximum likelihood 
estimators [HI]. Given a tree T C U^ =1 -A J ', the maximum likelihood of the sequence X\, . . . ,X n is 
given by 

Pml,tW) = II II Pn(a\w) N ^ w - a \ (2.4) 

w£T a£A 

where the empirical probabilities p n (a\w) are 

N n (w,a) 

if N n (w) > and p n (a\w) = 1/\A\ otherwise. For any sequence w £ A* we define 

PML,«w)=nA'W tt )' ,,M ' 

Hence, we have 

In order to measure discrepancy between two probability measures over A we use the Kullback- 
Leibler divergence, defined for two probability measures P and Q on A by 



D(P; Q) = £)P(o) log 



Q(c 



where, by convention, P(a) log = if P(a) = and P(a) log ^|^y = +oo if P(a) > Q(a) = 0. 

2.1. The algorithm Context 

The algorithm Context introduced by J. Rissanen in J2|| computes, for each node of a given 
tree, a discrepancy measure between the transition probability associated to this context and the 
corresponding transition probabilities of the nodes obtained by concatenating a single symbol to 
the context. Beginning with the largest leaves of a candidate tree, if the discrepancy measure is 
greater than a given threshold, the contexts are maintained in the tree; otherwise, they are pruned. 
The procedure continues until no more pruning of the tree can be performed. 

For all sequences w £ V n let 

A„H= J2 N n {bw)D(p n (-\bw);p n (-\w)). 

b: bw£V n 

Remark 2.6. We use here the original choice of divergence A n (w) proposed by J. Rissanen in 
fk~dl l. but other possibilities have been proposed in the literature (see for instance |2, \T3il). 

We will denote the threshold used in algorithm Context on samples of length n by S n , where 
(<5n)neN is a sequence of positive real numbers such that S n — > +oo and 5 n /n — > when n +oo. 
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For a sequence X™, let C w (Xi) 6 {0,1} be an indicator function denned for all w £ V n by the 
following induction: 

Cpq*) = I 0, if iVn(u;) - 1 or H - d ' (2 7) 

1 |max{l{A„(w) > 5 n },ma,x beA C bw (X?)} : if N n (w) > 1 and |tu| < d. 

With these definitions, the context tree estimator Tc(Xi) is the set given by 

f c (X?) = {we V n : C W (X?) = and C U (X?) = 1 for all u -< w} (2.8) 

2.2. The penalized maximum likelihood criterion 

The penalized maximum likelihood criterion for the sequence X™ is defined by 

fpML(X?) = argmax f logPML,T(*i) - \T\f(n)\ , (2.9) 

where f(n) is some positive function such that f(n) — > +oo and f(n)/n — > when n — >• oo. 

This class of context tree estimators was first considered by Csiszar and Talata in who 
introduced the Baycsian Information Criterion (BIC) for context trees and proved its consistency. 
The BIC leads to the choice of the penalty function f(n) = (\A\ — l)log(n)/2. It may first 
appear practically impossible to compute Tpml(X™), because the maximization in (|2.9|) must be 
erformed over the set of all candidate trees. Fortunately, Csiszar and Talata showed in their article 



how to adapt the Context Tree Maximizing (CTM) method [27J in order to obtain a simple 
and efficient algorithm computing Tpml{X™). As the representation of the estimator Tpml{Xi) 
given by this algorithm is important for the proof of our results, we briefly present it here. Define 
recursively, for any w £ V ni with \w\ < d, the value 

K i ,(Xf)=max{e-^ n )p MLi » ; (Xr) ! J] V bw (X[ 1 )} (2.10) 

b£A: bw£V n 

and the indicator 

X w (X?) = l{ J] V bw (X?) > e-f^F M L, w ( X t)} ■ (2-H) 

b£A: bw£V n 

By convention, if {6 6 A: hw £ V„} = or if \w\ = d then V W (X?) = e- f ^F M ^ w (Xf ) and 
X W (X?) = 0. As shown in Q, it holds that 

TpMLiXl 1 ) = {w £ V n : XJXI 1 ) = and X u (X[ l ) = 1 for all u ~< w] . (2.12) 

2.3. Results 

In this subsection we present the main results of this article. First, we show that the empirical 
tree given by the algorithm Context is always included in the tree given by the penalized maximum 
likelihood estimator, if the threshold S n is smaller than the penalization function f(n). 

Proposition 2.13. For any n > 1 and all sequences X", if S n < f(n) then 

Tpml(X?) ± f c (X?). 

In the sequel we will assume that the cutoff sequence of the algorithm Context equals the 
penalization term of the penalized maximum likelihood estimator, in order to allow a unified 
treatment of the two algorithms. That is, we will assume that S n = f(n) for any n > 1. 

We now state a new bound on the probability of over-estimation that does not require any 
mixing hypotheses on the underlying process. 
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Theorem 2.14. For every n > 1 it holds that 

. ]nv(n\ -4- I A\ 2 \ n 2 pvn I - 

\A\ 



f(XT)=<To) > l-e(S n \og(n) + \A\ 2 )n 2 exp[- 1 ^) , (2.15) 



where f(X?) = t PML {X?) or t(X?) = t c {Xf). 

Remark 2.16. Theorem \2~T^ is proven without assuming any bound on the height of the hypothet- 
ical trees. That is, the result remains valid even if d = — oo. But if the candidate trees have only a 
limited number of nodes, possibly depending on n (see, e.g, fH, a straightforward modification 
of the proof shows that 

P (f(X?) 1 T ) > 1 - 2e (5 n log(n) + \A\ 2 ) k{n) cxp (- j^jj ) , 

where k(n) is the maximal number of nodes of a candidate tree. In particular, if the height of the 
trees is smaller than a function d{n) (possibly constant) then k(n) = \A\ d ^ . 

The problem of under-estimation in context tree models is very different, and requires additional 
hypotheses on the process {X t : t G Z}. For any w 6 A* with p(w) > define the coefficient 

(3{w, r) = max max {\p(a\w) — p(a\uw)\} . 

The continuity rate of the process {X t : t £ Z} is the sequence {/3fc}fcgN where 

Pk = max sup {/3(w, r)} . 

w£A k r>\ 

Define also the non-nullness coefficient 

a :=E kf {P(0M}- (2-17) 

aeA 

Our underestimation error bounds will rely on the following assumption. 
Assumption 1. The process {X t : t £ Z} satisfies the following conditions 

1. ao > (weakly non-nullness) and 

2. j3 := X^fceN^fc < (summable continuity rate). 

These are classical hypotheses for processes of infinite memory, which arc also referred to as 
chains of type A, see for instance [HJ and references therein. 

To establish upper bounds for the probability of under-estimation we will consider the truncated 
tree Tq\k, for any given constant K £ N. Note that in the case To is a finite tree, Tq\k coincides 
with To for a sufficiently large constant K. The bounds are stated in the following theorem. 

Theorem 2.18. Assume the process {X t : t £ Z} satisfies Assumption]]^ Let JfeN and let d be 
such that 

min max {f3(w, r)} > e > . (2.19) 

w^,u£Tq\k r<d—\w\ 

Then, there exists hq £ N such that for any n > n$ it holds that 

„,2 r„d 8\A\df(n) ]2 



(T \k± f(X?)\ K ) > 1 - 3 e W32e 2 |A| 2 (|A|^ + 2ao)| A 



2+K exp 



16(d + l) 



(2.20) 

where f(Xf) = T PML {X?) orf(X?) = T C (X?) and p min = mm aeAtWeAd {p{a\w) : p(a\w) > 0} . 

Remark 2.21. It can be seen that for any A' £ N there is always a value of d such that (|2.19j) 
holds. This hypothesis can be avoided by letting d increase with the sample size n and by controlling 
the upper bounds in \2.20)) . Extensions of Theorem \ 2.18\ can also be obtained by allowing K to 
be a function of the sample size n. In this case, the rate at which K increases must be controlled 
together with the rate at which e and p m in decrease with the sample size. This leads to a rather 
technical condition, see for instance \2a l. 
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Finally, the next theorem states the strong consistency of the estimators Tc{X™) and Tpml{Xi) 
for appropriate threshold parameters and penalizing functions, respectively. 

Theorem 2.22. Assume the hypotheses of Theorem \2.18\ are met. Then for any threshold param- 
eter (5 n )neN such that 

X! cx p \~rm + lo s( 5 « lo s( n )) ) < +°° 

we have Tc(X")\k — Tq\k eventually almost surely as n — > +00. Similarly, if we choose f(n) = 5 n 
we have Tpml(X™)\k = Tq\k eventually almost surely as n — > +00. 



3. Proofs 

3. 1 . Proof of Proposition \2.13l 

We must prove that a leaf in Tpml(Xi) is always a leaf or an internal node in Tc(X™). By the 
characterization of Tc{X™) and Tpml(Xi) given by equations (|2.8[) and (|2.12[) , respectively, this 
is equivalent to proving that X W {X™) < C W (X™) for all w S V n with \w\ < d. In fact, assume that 
X W (X?) = 1 implies C W {X?) = 1, and take w £ f PML (X?); then, cither \w\ = d and w E f c {X?), 
or it holds that for all u -< 10, X U (X") = 1, which implies by assumption that C U (X") = 1. Now, 
if C W (X™) = 0, then w S Tc{X™ ); otherwise, w is a proper suffix of a sequence v E Tc(Xf). In 
any case, w is a leaf or an internal node of Tcpf"). 

Assume there exists w E V n , |tu| < d, such that Af^Xf) = 1 and C W {X™) = 0. Note that by 
(12.71) . C w (Xf) = implies C UW {X™) = for all uw £ V„, |uw| < d; hence, w can be chosen such 
that X bw (Xi) = for any bw £ V n , 6 E A In this case we have, by the definitions (12.10[) and 
([2~TTj) that 

= n e " /{re *ML,toW). (3.2) 

b: bw£V„ 

The equality in the second line of the last expression follows by the fact that X bw (X™) = for any 
bw E V n , b e A; therefore we must have V bw (X™) = e~ /(n) P M L,;m>(X™) for any bw £ V n , b E A. 

Now, observe that for any a E A, N n (w,a) = J2 b - bw&v X n (bw,a) and \{b: bw E V„}| > 2. 
If not, N n (w, a) would be equal to N n (cw,a) for some c E A and for all a E A, implying that 
hsL,ew{X?) = Pml,™(A7); hence 

b: bw£V„ 

and thus, by definition, X W (X±) = 0. Using these facts, and taking logarithm on both sides of 
Inequality ()3.1|) . we obtain 



{b: bw E V„}| - l) /(n) < J2 Y, Nni - bw ' a ) lo S 



Pn(a|bw) 
p„(a|w) 

= J2 X n {bw)D{p n {-\bw);p n {-\w)) = A n (w). 

b: bweVn 

Therefore, if 5„ < f(n) we have S n < A n (w) which contradicts the fact that C W (X™) = 0. This 
concludes the proof of Proposition 12.131 
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3.2. Proof of Theorem \2~14\ 

We will prove the result for the case f(X?) = f c (X?). The case f(X?) = T PML {X%) follows 
straightforwardly from Proposition 12.131 and equality f(n) = S n . 

Let O n be the event {Tc(X") ^ T }. Overestimation occurs if at least one internal node w 
of Tc(Xi) has a (non necessarily proper) suffix s in To; that is, if there exists a (possibly empty) 
sequence u such that w = us. Thus, with a little abuse of notation O n can be written as 

°n= (J U i A «(us) > 5 n }. 
seT ueA* 

For any sequence w G A* we have that p n (-\w) are the maximum likelihood estimators of the 
transition probabilities p(-\w), therefore we have that 



A n (w) N n {bw)D (p n {-\bw);p n {-\w)) 

beA 

= N n (bw) (p(a\bw) log p(a\bw) — p(a\bw) log p(a\w)) 

beA aeA 

= I E N n (bw) p(a\bw) logp(a\bw) J - ^ ^ N n (bw, a) logp(a\w) 

\beA aeA ) beA aeA 

= y^^Nnjbw) p(a\bw) log p(a\bw) - ^ N n (w, a) log p(a\w) 

\beA aeA / aeA 

< I N n (bw) p(a\bw) log p(a\bw) J - ^ iV„(w, a) log p(a\w) 

\beA aeA ) aeA 



N n (bw) ^2p(a\bw) logp(a\bw) - ^ ^ N n (bw, a) logp(a\w) 

\beA aeA J beA aeA 

y^ N n (bw) (p(a|&u>) logp(a|6u>) — p(a|6w) logp(a|io)) 



66A a£A 

= 5^Jv n (6i«)^(Pn(-IH;p(-l«')) 

Hence, as for all 6 e ^4 it holds that p(-|iu) = p(-|6w) we obtain 

P(A„H > 5 n ) < plj2 N ^ bw ) D (Pn(-\bw);p(-\bw)) > S n Y 

\beA / 

Using Theorem IA.71 stated in Appendix A, it follows that 

no n ) <J2 E f(a„(us) > s n ) 

seT ueA* 

< E E P lj2N n (bus)D(p n (-\bus);p(-\bus)) > S n ) 
seT ueA* \beA ) 

< E E E P (N n (bus)D(p n (-\bus);p(-\bus)) > ^ | N n (bus) > o) P (N n (bus) > 0) 
seT ueA* beA ^ 11/ 

<2e(5 n logn+\A\ 2 )exp(--^) J2 E X>W.(M>0) 

< 2e (<5„logn+ |^| 2 )exp (-^) E[C n ], 



\A\* 

where C n denotes the number of different contexts of the symbols in X™. But C n is always upper- 
bounded by the number n(n — l)/2 of (non-necessarily distinct) contexts of X™, and the result 
follows. 
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3.3. Proof of Theorem\2JR 

In this case we will prove the result for the case T(X") = Tpml{X™). The case T(X™) = 
Tc(X") follows again from Proposition 12. 131 and the assumption that 5 n = f(n). 

If U n denotes the event {T \ K ^ T PML {X?)\ K } then 

u n c |J {x v ,(x?) = oy. 

w^u£T \ K 

Let w -< u S To | ^ . Then we have 

P(*JX?) = 0) = p( [] ^(X?) < e-^PML,„(ir) J • (3.3) 

\ a£A: awieV„ / 

By hypothesis, there exists r < d — \w\ and s 6 A r such that 

max \p(a\w) — p(a\sw)\ > e. 

If s = (si . . . s r ), denote by = A \ {s;| and let T be the tree given by 

T = U[ =2 U 66Al {6s>} U • 
By definition, for any aw G V n it can be shown recursively that 

Vw(X?)=wax n e-^™)p ML ^(Xr) 

see for example Lemma 4.4 in Q. Therefore, 

P( II < e- /(n) PML,„,(XD) 

\ aeA : aweV n / 

< P ( II e_/( " )] PML,«(Xi n ) < e-^WPML^W) J (3.4) 



by noticing that 



II max J] e-^")^,.^?) > II II e ~ f(n) ^ML,v( X i. 

a£A: aw£V n " V&T' a£A: awigV„ u£r am 



> ^ I e-^PML.^X? 



tiGT 



Applying logarithm and using that N n (w,a) = ^2 ueT N n (u,a) for any a € A we can write the 
probability in (|3.4p by 



F^J2 N n(u)D{p n (.\u);p n (-\w)) < (|T|-l)/(n)) 

< p(iV n (su;)I?(p n (-|s^);Pn(») < (|T|-l)/(n)). (3.5) 

Define the events A% w and B^ 1 " by 

A^ = {X?: N n ( S w)D(p n (-\sw);p n (-H) < (\T\-l)f(n)} 
BT = {X?: D(p n (-\sw);p n (-\w)) > e 2 /8} . 

Then we can bound above the probability in {£5) by P( A^ w n ) + P( (-B^) ) . To bound 
the first term note that by Lemma [B. Ill if n satisfies 

/(") e 2 p(sw) 



8(|T|-1) 
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then, using the bound \T\ — 1 < \A\r < |^4|<i we obtain 

V(A% m nB% m ) < F^N n (sw) < 8(|r| ~ 2 1)/(n) 



< e ao/ee»|A|»(|A|g+aao)M| /^ ZliMf^ Z ^I'n^ ? ) 

V \sw\ + 1 / 



On the other hand, by Lemma TB. 131 we have 

p(W) c ) < 2e a "/ 32e2 i A i 2 (i A i^+ 2 "°)(l^l + ^^[-" le^itV 

We conclude the proof of Theorem l2 . 181 by observing that we only have a finite number of sequences 
w -< u G To I if, therefore we obtain 



nUn) < 3e a °/ 32e2 l A l 2 (l A ^+ 2a °)|A| 2+ *exp 



2r„d 8|A|rf/(n) l2 ' 



716 "-mm 



16(d+l) 



34. Proof of Theorem\ME 

The statement of the Theorem follows straightforward from Theorems 12.141 and 12.181 and the 
Borcl-Cantelli Lemma, by noticing that the upper bounds for 

p(Tb(x?)k ^ r |jf) < F(f c (x?)\ K ^ t \k) + nn\ K i f c {x?)\ K ) 

are summable in n. The same reasoning applies to Tpml(Xi) when f(n) = S n . 
4. Discussion 

In this paper we showed a relation between two classical algorithms for context tree selection. 
We proved that for a proper set of parameters, the Penalized Maximum Likelihood estimator 
always yields a smaller tree than the tree given by the algorithm Context. This relation between 
the empirical context trees allows us to derive, in an unified way, non-asymptotic bounds for the 
probability of over- and under-cstimation of the context tree generating the sample. The tree may 
be unbounded, and our results apply to processes that do not necessarily have a finite memory. 

Concerning under-cstimation, we assume the process satisfies some conditions that implies 



exponential inequalities for the empirical probabilities. These inequalities were obtained in 17 1 
under a stronger non-nullness assumption; namely, that the transition probabilities were lower 
bounded by a positive constant. In this paper we show that the results also hold for a larger 
class of processes. It is conjectured that similar results cannot be obtained without assuming any 
non-nullness nor mixing condition of the process. 

Concerning over-estimation no mixing assumption is necessary for Theorem 12 . 141 to hold. This 



improves on and generalizes the results obtained in 17|, l22|. Our proof is based on deviation 
inequalities obtained for empirical Kullback-Lciblcr divergence, instead of LP norm; it appears that 
this pseudo-metric is more intrinsic for binomial distributions (and partially also for multinomial 
distributions), as the binary Kullback-Lciblcr divergence is the rate function of a Large Deviations 
Principle. Deriving similar inequalities is also possible for other distributions and thus other 



pseudo-metrics, or by using upper-bounds of the Legendre transform of the distribution, as in 21 1. 
These type of inequalities are interesting on their own and prove useful in various settings: other 
applications of similar bounds may be found in [21, 14, 20( | . 



From the point of view of most applications, over- and under-estimation play a different role. 
In fact, data-generating processes can often not be assumed to have finite memory: the whole 
dependence structure cannot be recovered from finitely many observations and under-estimation 
is unavoidable. All what can be expected from the estimator is to highlight evidence of as much 
dependence structure as possible, while maintaining a limited probability of false discovery. 

Our results imply the strong consistency of the algorithm Context for processes of infinite 



memory, generalizing the convergence in probability of this estimator previously obtained in [12 1 . 
Likewise, the strong consistency for the PML estimator is also derived for a larger class of penalizing 
functions than in |22j. 
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Appendix A: Martingale deviation inequalities 

This section contains the statement and derivation of two deviation inequalities that are useful 
to prove the results of this paper. As they are interesting on their own, we include them in a 
separate section. The ingredients of the proofs are mostly inspired by [24j , see also Q . 

We briefly recall some notation so as to keep this section self-contained. Let (X n ) neZ be a 
stationary process whose (possibly infinite) context tree is To, and let T n be the a- field generated 
by [,X 3 )j< n . For k G N, w G A k , denote p(b\w) = P (X k+1 = b\Xf = w). For j > 1, define 

Cj = l{XjZl = w} and X j = HXj_ k = wb} , 

so that N n (w) = YTj=i^o and N n (w,b) = YTj=\Xv Denote }5 n (6|w) = N n (w,b)/N n (w). The 
Kullback-Leibler divergence between Bernoulli variables will be denoted by d: for all p,q E [0, 1], 

P 1 — P 

d{p;q) =fdog- + (1 -p)log- . 

q 1-q 

Proposition A.l. Let k be a positive integer, let w E A k and let b G A. Then for any 6 > 

P[N n (w)d(p n (b\w);p(b\w))>5] < 2e \5log(n)} exp(-5). 
Proof. Denote by p — p(b\w), N n = N n (w), S n = N n (w, b) and p n = S n /N n . For every A > 0, let 

P (A) = logE [cxp (AXx)] = log (1 - p + pcxp (A)) . 
Let also Wq = 1 and for t > 1, 

= cxp(XS t — N t -i4> p (X)). 
First, note that (W t r) t>0 is a martingale relative to (-7 r t) t > with expectation E[W A ] = 1. In fact, 

E [exp (A (S t+ i - St)) | Ft] = E [exp (Axt+i) \Ft[ 

= exp (£ t <f> P (A)) 

= exp ((N t — N t -i) <j) p (A)) 

which can be rewritten as 



E [exp (XSt+i - N t <f> p (A)) \T t ] = exp (XS t - Nt-ifa (A)) . 



To proceed, we make use of the so-called 'peeling trick' [23(: we divide the interval {1, . . . , n} 
of possible values for into "slices" {tk-i + 1, . • • , tfc} of geometrically increasing size, and treat 
the slices independently. We may assume that 6 > 1, since otherwise the bound is trivial. Take 
rj = 1/(5 - 1), let to = and for k G W, let t k = [(1 + rj) k \ . Let m be the first integer such that 



t m > n, that is m 



lo» 



log 1 + 7] 



Let A k = {i fe _i < N n < t k } n {N n d (p n ;p) > 5}. We have: 



'{N n d(p n ;p) >S) <P \jA k \ <Y, F ( A k) 



(A.2) 



k=l 



We upper-bound the probability of A k n {p„ > p}, the same arguments can easily be transposed 
for left deviations. Let s be the smallest integer such that 5/(s + 1) < d(l;p); if A^„ < s, then 
N n d(p n ,p) < sd(p n ,p) < sd(l,p) < S and F(N n d(p n ,p) > 5,p n > p) = 0. Thus, P(A fc ) = for all 
fc such that t k < s. 

Take fc such that t k > s, and let tk-i = max{tn, s}. Let x G]p, 1] be such that d(x;p) = 5/N n , 
and let A(x) = log(x(l — p)) — log(p(l — a;)), so that d(x;p) = X(x)x — 4> P (X). Let z such that 
z > p and d(z,p) = 6/(1 + ij) k . Observe that: 
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• if N n > tk~i, then 



• if N n < tk then, as 



we have : 



d{Z;p) = rr- > 



d{p n ;p) > Tf > 



(l + v ) k ~ {l + r))N n ' 
d(z;p), 



N n (1 + 7 ? ) fc 



p n > p and d(p n ; p) > — =>• p n > z. 



Hence, on the event {tk-i < N n < tk} PI {p n > p} n j<i(p n ;p) > i it holds that 

8 



\{z)p n - <j> p {\{z)) > X(z)z - (t> p (A(z)) = d(z;p) > 



(1 + V)Nn 



Putting everything together, 



{4-i < N n <t k }n {p n > P }n id(p n ; P ) > ^- J c l\(z)p n - MH*)) > N ^ 

C |a(z)5„ - N n <t> p (\(z)) > 

c j^„ >exp (_!_)}. 

As (W t x ) t>Q is a martingale, E \Wn (z) \ = E [w,^^ 

P ({4-1 < A n < i fe } n {P« ^ P> n P) > 8}) < P (V A(z) > exp - 

< exp ( — 

Similarly, 



1, and the Markov inequality yields: 

8 

> exp I — 

5_ 

1 + ?/ 



(A.3) 



P ({4-1 < N n <t k }n { Pn < p} n {A n d(p„,;p) > <5}) < exp ( -- 



so that 



P ({4-1 < iV„ < ife} n {A„d(p„,p) > 5}) < 2 exp f ~ • 
Finally, by Equation (|A.2|) . 

P {4-1 < A„ < t fc } n {N n d(p n ,p) > Sj^j < 2m exp 



But as 77 = 1/(5 — 1), m 



log(l + l/(5-l)) 



and as log(l + 1/(5— 1)) > 1/8, we obtain: 



'{N n d{p n ,p) >8) <2 



log n 



log ( 1 + ^ 



exp(-5 + 1) < 2e |~51og(n)] exp(-<5). 



□ 
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Remark A. 4. The bound of Proposition \A~l\ also holds for P (N n d(p n ,p) > 6\N n > 0): in fact, as 



1 = E 



W^ z) ] = E \w^ (z) \N n > ol ¥{N n > 0) + E \w^ {z) \N n = ol P(JV n = 0) 



= E 



\N n > ol P(JV„ > 0) + 1 - P(7V„ > 0), 



it follows that E 



N n > 



1 and starting from Equation \A.3l the proof can be rewritten 
conditionally on {N n > 0}; this leads to: 

P(N n d(p n ,p) > 8\N n > 0) < 2e \Slog(n)] cxp(-J). 

However, in general no such result can be proved for P (N n d(p„,p) > S\N n > k) for positive values 
ofk. 

To proceed, we need the following lemma: 
Lemma A. 5. For any probability distributions P and Q on the finite alphabet A, 

D(P;Q) <J2d(P(x);Q(x)) . 



Proof. 



J^d(P(x);Q(x))-D(P;Q) = £ (1 - P(x)) log - 



xeA 



xeA 



1 - P{x) 
~Q(x) 



x£A 



\A\-l 



(l-Q(x)/(\A\-l) 







because the sum in the next-to-last line is the Kullback-Leibler divergence between the probability 
distributions R and S defined on A by: 

R( x ) = ~tti — an< ^ s( x ) 



\A\-1 



\A\-1 ■ 



□ 



Remark A. 6. Obviously, this lemma is suboptimal for \A\ — 2 by a factor 2. For larger alphabets, 
it does not appear possible to improve on this bound for all P and Q. 

We are now in position to state the deviation result we use in order to upper bound the 
probability of over-estimation: 

Theorem A. 7. Let k be a positive integer and let w G A k . Then, for any S > 
F[N n (w)D{p„ (-\w);p(-\w))> 5} < 2e (6 log(n) + \A\) exp (--^ 
Proof. By combining Lemma I A . 5 1 and Proposition I A. 1 1 we get 



1 [N n (w)D(p n (-\w);p (») ><5] < 



J2^nd{p n (b\w)-p(b\w)) >5 



heA 



< E p 

< 2\A\e 



N n d{p n (b\w);p(b\w)) > — 



log(n) 



exp 



_5_ 



( ^log(n) + l)expf-|^ 



2e (<51og(n) + |A|)exp 
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□ 

Remark A. 8. It follows from Remark \A.4\ that the following variant of Theorem \A.7\ holds: 

( S 

P[N n (w)D(p n (-\w);p(-\w))>S\N n (w)>0] < 2e (<Jlog(n) + |A| - 1) exp 



1-41 - 1 



Appendix B: Exponential inequalities for weak dependent processes 

In this section we state some results providing exponential inequalities for processes satisfying 
Assumption [T] and prove two lemmas that are useful in the proof of Theorem 12.181 The first result 
is a version of Theorem 3.1 in [17| that we state under weaker conditions, given by Assumption [1] 

Proposition B.9. Assume the process {X t : t G Z} satisfies Assumption^ Then for any w G A*, 

any a G A and any t > the following inequality holds 

F(\N n (w,a)-np(wa)\ > t) < e «o/8e 2 (|A|/3+2 ao ) exp ( -lL- \ . 

V \wa\n J 

Proof. Theorem 3.1 in (l7| was proven for a process satisfying a stronger non-nullness hypothesis 
than our Assumption [TJ namely that inf wct {p (q \w) } > for any a G A. But the proof of the 
theorem is based on results obtained in [7| and [ll| that also hold for processes satisfying our 
weaker assumption. Moreover, the upper bound in Theorem 3.1 in (l7j depends on the coefficient 

a := ^(1 - a k ) , 

fe>0 

where for k > 1 

a k := inf V inf p{a\x~\ u) . 

ueA k Z_/ i 

But it can be shown that for any k > 1 we have 1 — ak < IA|/3fc, as noted by in their proof of 
Lemma 3. Therefore a < \A\f3 + cto and Theorem 3.1 in [l7| takes the form of Proposition lB~9l □ 

As a consequence of this result wc have the following lemma, proven in [22I , Corollary A. 7]. 

Lemma B.10. Assume the process {X t : t G Z} satisfies Assumption [IJ TTien /or any u> G A*, 
any a G A and any t > the following inequality holds 

¥(\p n (a\w)-p(a\w)\ > t) < e W32eVl 2 (l^+2a o)(|A| + 1)cx ^f -"^M 2 \ _ 

V |k;| + 1 / 

Now, we prove Lemmas IB. Ill and IB. 131 below. These two results are useful in the proof of 
Theorem ETH 

Lemma B.ll. Assume the process {X t : t G Z} satisfies Assumption]]^ Then for any w G A* 
and any t > such that t < np(w) we have 

P(NJw) <t)< e a °/ 8e2 W 2 ^+ 2 ^\A\exp( ~ nl f {w) ~ . (B.12) 

V |iu| + 1 / 

Proof. Using that N n (w) = J2aeA N n (w, a), p(u>) = X)ae.4 p{ wa ) an d * — np(w) < we have that 
V{N n (w) < t) = v( ^[N n (w,a) - np(wa)] < t-np(w)\ 

aeA 

E™/i / ^ / np(w) — t. 
P(\N n (w,a)-np(wa)\ > ) 

aeA ' ' 

Using Theorem IB .91 we can bound above the right hand side of the last inequality by 

e«°/«AW^)li[ ffip[ - M^| ], 

+ ljn 

This implies the bound in (|B.12|) . □ 
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Lemma B.13. Assume the process {X t : t 6 Z} satisfies Assumption^ Let u,w £ A* and b 6 A 
such that p(b\u) —p(b\w) > 0. Then, for any t < [p(b\u) — p(b\w)] 2 /8 we have that 

P(D(p„(»;p„(») < t) < 2e-/ 32e2 l^l^+ 2 -)(|A^^ 
v ' L 2 V \w\ + 1 \u\ + 1/ 

Proof. By Pinskcr's inequality (see, e.g, @, SectionA.2] for a proof) we have that 



D(p n (-\u);p n {-\w)) > -[^|p„(a| 



it) -p„(a|u>)| 



Now, set v = |[p(6|u) — p(6|w)] 2 and define the events 

(";;■::- = {X?-. \p n {b\u)-p{b\u)\ < ^7/2} n {x{ 1 -. \p n (b\w) - P (b\w)\ < VW^}- 

Then, if t < v we have that the event 

{X{ 1 : D(p n (-\u);p n (-H) < *} n C^'« = . 
To see this note that by (|B.14|) . if (|B.15|I holds then 



(B.14) 



(B.15) 



D(p n (»;p„(») > i 



= v > t. 



Therefore, using the bounds in Lemma IB. 101 we obtain for any t < v that 
P(D{p n (-\u); Pn (-\w)) <t) < P(\p n (b\u) - p(b\u)\ > ^Jj2) +F(\p n {b\w)-p(b\w)\ > ^Jjl) 



< 2e Qo / 32e2 ' /l ' 2( ' /l ' /3+2Qo) (| J 4| + 1) 



exp 



v . ( p(w) 2 p{u) 2 \ 

-n — mm - — ; , - — ; 

2 VM + l'ki+1/ 



□ 
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